2  Grammar of Graphics

Author

Aaron R. Williams

2.1 The Tidy Approach

2.1.1 Opinionated software

Opinionated software is a software product that believes a certain way of approaching a business process is inherently better and provides software crafted around that approach. ~ Stuart Eccles

2.1.2 Tidy data

The defining opinion of the tidyverse is its wholehearted adoption of tidy data. Tidy data has three features:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a dataframe. (This is from the paper, not the book)

Source: R for Data Science

Tidy data was formalized by Hadley Wickham in “Tidy Data” in the Journal of Statistical Software in 2014. It is equivalent to Codd’s 3rd normal form (Codd, 1990) for relational databases.

Tidy datasets are all alike, but every messy dataset is messy in its own way. ~ Hadley Wickham

The tidy approach to data science is powerful because it breaks data work into two distinct parts.

  1. First, get the data into a tidy format.
  2. Second, use tools optimized for tidy data.

By standardizing the data structure for most community-created tools, the framework oriented diffuse development and reduced the friction of data work.

2.2 Grammar of Graphics

ggplot2 is an R package for data visualization that was developed during Hadley Wickham’s graduate studies at Iowa State University. ggplot2 is formalized in “A Layered Grammar of Graphics” by Hadley Wickham, which was published in the Journal of Statistical Software in 2010.

The grammar of graphics, originally by Leland Wilkinson, is a theoretical framework that breaks all data visualizations into their component pieces. With the layered grammar of graphics, Wickham extends Wilkinson’s grammar of graphics and implements it in R. The cohesion is impressive and the theory flows to the code which informs the data visualization process in a way not reflected in any other data viz tool.

There are eight main ingredients to the grammar of graphics. We will work our way through the ingredients with many hands-on examples.


Exercise 0

Step 1: Open 2024_asa-data-viz.Rproj.

Step 2: Open 02_workbook.Rmd.


Exercise 1

Step 1: Type (don’t copy & paste) the following code below library(tidyverse) in the new chunk where it says # YOUR WORK GOES HERE.

ggplot(data = storms) + 
  geom_point(mapping = aes(x = pressure, y = wind))

Step 2: Add a comment above the ggplot2 code that describes the plot we created.

Step 3: As we progress, add comments below the data visualization code that describe the argument or function that corresponds to each of the first three components of the grammar of graphics.


Tip

1 Data are the values represented in the visualization.

ggplot(data = ) or data %>% ggplot()

storms %>%
  select(name, year, category, lat, long, wind, pressure)
# A tibble: 19,537 × 7
   name   year category   lat  long  wind pressure
   <chr> <dbl>    <dbl> <dbl> <dbl> <int>    <int>
 1 Amy    1975       NA  27.5 -79      25     1013
 2 Amy    1975       NA  28.5 -79      25     1013
 3 Amy    1975       NA  29.5 -79      25     1013
 4 Amy    1975       NA  30.5 -79      25     1013
 5 Amy    1975       NA  31.5 -78.8    25     1012
 6 Amy    1975       NA  32.4 -78.7    25     1012
 7 Amy    1975       NA  33.3 -78      25     1011
 8 Amy    1975       NA  34   -77      30     1006
 9 Amy    1975       NA  34.4 -75.8    35     1004
10 Amy    1975       NA  34   -74.8    40     1002
# ℹ 19,527 more rows
Tip

2 Aesthetic mappings are directions for how variables from the data are mapped to visual elements in the data visualization. Aesthetic mappings show variation in the data through variation in the data visualization. Aesthetic mappings include linking variables to the x-position, y-position, color, fill, shape, transparency, and size.

aes(x = , y = , color = )

X or Y

Color or Fill

Size

Shape

Others: transparency, line type

Tip

3 Geometric objects are representations of the data, including points, lines, and polygons.

Plots are often called their geometric object(s).

geom_bar() or geom_col()

geom_line()

geom_point()


Exercise 2

Step 1: Duplicate the code from exercise 1. Inside aes(), add color = category. Run the code.

Step 2: Replace color = category with color = "green". Run the code. What changed? Is this unexpected?

Step 3: Remove color = "green" from aes() and add it inside inside of geom_point() but outside of aes(). Run the code.

Step 4: This is a little cluttered. Add alpha = 0.2 inside geom_point() but outside of aes().

Aesthetic mappings like x and y almost always vary with the data. Aesthetic mappings like color, fill, shape, transparency, and size can vary with the data. But those arguments can also be added as styles that don’t vary with the data. If you include those arguments in aes(), they will show up in the legend (which can be annoying! and is also a sign that something should be changed!).


Exercise 3

Step 1: Create a new scatter plot using the msleep data set. Use bodywt on the x-axis and sleep_total on the y-axis.

Step 2: The y-axis doesn’t contain zero. Below geom_point(), add scale_y_continuous(limits = c(0, NA)). Hint: add + after geom_point().

Step 3: The x-axis is clustered near zero. Add scale_x_log10() above scale_y_continuous(limits = c(0, NA)).

Step 4: Add and run options(scipen = 999). Rerun the code from steps 1-3.

Tip

4 Scales control the exact behaviors of aesthetic mapping. scale_*_*() functions can change:

  • The range and labels on the x-axis and y-axis
  • The colors used for color and fill
  • The sizes of shapes
  • Shapes

There are dozens of scale functions and their names follow a formula:

  • They all start with scale_.
  • Next, comes the name of the aesthetic for the scale (i.e. x, y, fill, size, etc.).
  • Finally, comes the type of variable or transformation (i.e. discrete, continuous, and reverse).

scale_x_continuous() and scale_y_continuous() are two popular scale_*_*() functions.

Before

scale_x_continuous()

After

scale_x_reverse()

Before

scale_size_continuous(breaks = c(25, 75, 125))

After

scale_size_continuous(range = c(0.5, 20), breaks = c(25, 75, 125))


Tip

5 Coordinate systems map scaled geometric objects to the position of objects on the plane of a plot. The two most popular coordinate systems are the Cartesian coordinate system and the polar coordinate system.

coord_polar()

Exercise 4

Step 1: Reggee sleeps at least 16 hours per day. Run the follow code to make a stacked bar chart.

adequate_sleep <- msleep |>
  mutate(sleep14 = if_else(
    sleep_total > 14, 
    "Adequate sleep", 
    "Inadequate sleep")
  ) |>
  count(sleep14) |>
  mutate(prop = n / sum(n))

ggplot(data = adequate_sleep, aes(x = "", y = prop, fill = sleep14)) +
  geom_col()

Step 2: Add coord_polar(theta = "y") to make a pie chart.

Step 3: Add scale_fill_manual(values = c("Adequate sleep" = "green", "Inadequate sleep" = "red")) to change the colors of the sections.

Exercise 5

Step 1: Create a scatter plot of the storms data set with pressure on the x-axis and wind on the y-axis.

Step 2: Add facet_wrap(~ month)

Tip

6 Facets (optional) break data into meaningful subsets. facet_wrap(), facet_grid(), and facet_geo().

facet_wrap()

facet_wrap(~ category)

facet_grid()

facet_grid(month ~ year)


Exercise 6

Step 1: Add the following code to your script. Submit it!

ggplot(storms) +
  geom_bar(mapping = aes(x = category))

Tip

7 Statistical transformations (optional) transform the data, typically through summary statistics and functions, before aesthetic mapping.

Before transformations, each observation in data is represented by one geometric object (i.e. a scatter plot). After a transformation, a geometric object can represent more than one observation (i.e. a bar in a histogram).

Note: geom_bar() performs statistical transformation. Use geom_col() to create a column chart with bars that encode individual observations in the data set.


Exercise 7

Step 1: Duplicate Exercise 6.

Step 2: Add theme_minimal() to the plot.

2.2.1 Themes

Tip

8 Theme controls the visual style of plot with font types, font sizes, background colors, margins, and positioning.

Default theme

Theme Minimal

fivethirtyeight theme

urbnthemes

If you prefer the minimal theme, you can add theme_minimal() to each visualization or add theme_set(theme_minimal) at the beginning of your script.


Exercise 8

Step 1: Add the following exercise to you script. Run it!

storms %>%  
  filter(category > 0) %>%
  distinct(name, year) %>%
  count(year) %>%
  ggplot() + 
  geom_line(mapping = aes(x = year, y = n))

Step 2: Add geom_point(mapping = aes(x = year, y = n)) after geom_line().

Tip

Layers allow for distinct geometric objects and/or distinct data sets to be combined in the same data visualization.


Exercise 9

Step 1: Add the following exercise to you script. Run it!

ggplot(data = storms, mapping = aes(x = pressure, y = wind)) + 
  geom_point() +
  geom_smooth()
Tip

Inheritances pass aesthetic mappings from ggplot() to later geom_*() functions.

Notice how the aesthetic mappings are passed to ggplot() in example 9. This is useful when using layers!


Exercise 10

Step 1: Pick your favorite plot from exercises 1 through 9 and duplicate it in a new code chunk.

Step 2: Add ggsave(filename = "favorite-plot.png") and then look at the saved file.

Step 3: Add width = 6 and height = 4 to ggsave().


3 Review

3.0.1 Theory

  1. Data
  2. Aesthetic mappings
  3. Geometric objects
  4. Scales
  5. Coordinate systems
  6. Facets
  7. Statistical transformations
  8. Theme

3.0.2 Functions

  • ggplot()
  • aes()
  • geom_*()
    • geom_point()
    • geom_line()
    • geom_col()
  • scale_*_*()
    • scale_y_continuous()
  • coord_*()
  • facet_*()
  • labs()
  • ggsave()

3.1 Resources